Reliability of COVID-19 data: An evaluation and reflection

Importance The rapid proliferation of COVID-19 has left governments scrambling, and several data aggregators are now assisting in the reporting of county cases and deaths. The different variables affecting reporting (e.g., time delays in reporting) necessitates a well-documented reliability study examining the data methods and discussion of possible causes of differences between aggregators. Objective To statistically evaluate the reliability of COVID-19 data across aggregators using case fatality rate (CFR) estimates and reliability statistics. Design, setting, and participants Cases and deaths were collected daily by volunteers via state and local health departments, as primary sources and newspaper reports, as secondary sources. In an effort to begin comparison for reliability statistical analysis, BroadStreet collected data from other COVID-19 aggregator sources, including USAFacts, Johns Hopkins University, New York Times, The COVID Tracking Project. Main outcomes and measures COVID-19 cases and death counts at the county and state levels. Results Lower levels of inter-rater agreement were observed across aggregators associated with the number of deaths, which manifested itself in state level Bayesian estimates of COVID-19 fatality rates. Conclusions and relevance A national, publicly available data set is needed for current and future disease outbreaks and improved reliability in reporting.


Introduction
In the wake of the COVID-19 (2019 novel coronavirus) pandemic, death rates and spatial mapping, not dissimilar to methods used during the 19th century London cholera outbreak, have become talking points of the 21st century [1,2]. As COVID-19 slowly gained momentum in late winter and early spring of 2020, governments and other organizations scrambled to collect and present temporo-spatial data. When governments, understandably, struggled with the proliferation of COVID-19, many non-governmental organizations and universities helped with the COVID-19 data collection by innovating with data aggregation techniques (e.g., web scraping, crowd-sourcing) [1][2][3][4].
Despite new technology and methods, aggregating COVID-19 data remains difficult and potentially error-prone due to the sheer amount of data collection methods and definition disparities, the novelty of the worldwide data tracking process, and the attempt to collect community-level information (e.g., county) [5]. Due to the efforts of multiple aggregators and their various methods, the COVID-19 pandemic provides a unique opportunity to evaluate the statistical reliability of near-real time and fast-moving infectious disease surveillance data. Thus, a well-documented reliability study examining and understanding data collection methods and possible causes of differences between aggregators, as well as methods for correcting these differences, is essential knowledge for future infectious disease outbreaks. This is the current goal of this paper: to take a tiny glacial step and begin the evaluation of the COVID-19 data collection process.
Despite the validity challenges, such as relying on surveillance data, as Hartley and Perencevich [6] have pointed out, leveraging continuous COVID-19 data from multiple sources to evaluate public health interventions in near real-time far outweighs the inevitable inaccuracies. In fact, technology has truly transformed the disease surveillance response to COVID-19 [7,8]. Local and federal governments, hospitals, newspaper outlets, universities, the Centers for Disease Control and Prevention (CDC) and other organizations have worked together to aggregate and survey COVID- 19 [9]. Once collected and combined by aggregators, COVID-19 data has been used to create geo-maps and other data visualizations and statistical models to track, predict, and understand the virus [8,10,11]. This aggregated COVID-19 data has helped governments and communities formulate responses, allocate resources, measure the effectiveness of policy interventions, such as stay-at-home orders, mask mandates, vaccinations, and provide guidance in loosening restrictions [12][13][14][15].
As a result, the COVID-19 data collection and reporting process has transformed, not only our ability to surveil infectious disease, but to also forecast cases and outcomes under varying government policies [8,11,[16][17][18]. This has been integral in preventing disease and saving lives [19]. While new data and technology solutions have been invaluable in easing uncertainty and providing facts in a sea of unknowns, much of the data still remains unavailable at finer resolutions than the county-level [1,7]. There is also the challenge of aggregating data across government agencies that collect and report their data differently [20][21][22][23]. Information and transparency in the U.S. related to data tracking methods is often limited and varied [24] as there are numerous independently established systems for reporting disease cases and deaths [6,9]. Thus, numerous challenges remain with collecting and aggregating reliable COVID-19 data [24,25].
In an effort to help move "real-time" disease surveillance forward, this study, done by BroadStreet in conjunction with the COVID-19 Data Project [26], attempts to make several contributions. First, we describe our COVID-19 data collection process and observations of the process, which has not previously been documented for COVID-19. Second, we examine the reliability of several COVID-19 data aggregators, including the CDC-endorsed USAFacts (USAF) [27], Johns Hopkins University (JHU) [28], New York Times (NYT) [29], The COVID Tracking Project (CTP) [30], and BroadStreet (BS), [26] at the national, state, and county levels. These aggregators were chosen because their data collection methods and data sets are publicly available, and are cited and used by a variety of organizations (e.g., CDC and Google). Understanding the differences between these aggregators will allow researchers to focus on, and potentially develop, more reliable disease tracking and data collection methods. Lastly, it is essential to examine how COVID-19 reporting differences may be manifested in commonly used tracking statistics, thus, we examined the case fatality rate estimates at the state level.

Data collection process
Starting on March 16th, 2020, the Broadstreet team (consisting of approximately 120 volunteers) [31] began tracking diagnosed cumulative cases of, and deaths due to COVID-19 reported by state and county governments [26]. Broadstreet volunteers were recruited from a variety of universities through public health and other related undergraduate and graduate departments, and they were eligible to participate in this project if they had any interest or experience in a public health-related field. Following CDC guidelines published on 4/5/2020 [32], these volunteers tracked case and death totals using various sources and organizing them within Google Sheets. Volunteers were organized into six regional teams consisting of members acting in daily data entry, management, and quality assurance roles. Probable cases are defined by the CDC as being: 1. Diagnosed through epidemiologically linking individuals expressing COVID-19 symptoms to a known case; COVID-19 totals could be entered, as well as an "Unknown County" for cases that could not be assigned to a county. The sources used were official state or local government websites. In some cases this was supplemented with secondary sources, such as news sources, due to infrequent or nonspecific reporting by primary sources. Team managers examined the accuracy of daily totals to identify and correct errors. The Quality Assurance team then compared Broad-Street's county-level cumulative totals to those reported by other aggregators, including the NYT [29], JHU [28], CTP [30], and USAF [26], to check the reliability of entered data. If significant discrepancies between aggregators existed, Quality Assurance performed research to determine the most "accurate" count totals, and then left a comment with the results of their research and any changes. All team members signed off on a tracking sheet after completing their assigned tasks to ensure accountability. In situations where the decrease was caused by a one-day anomaly in the totals reported, this was assumed to be a reporting error and the anomalous data was updated to match the following day. In the case of a simple decrease in cumulative totals, if research did not produce an explanation of the cause, then the assumption was made that this was due to cases or deaths being reassigned to a different county, and the historic totals in the initial county were reduced and transferred into an "Unknown County".
In an effort to begin comparison for reliability statistical analysis, BroadStreet collected data from other COVID-19 aggregator sources, including USAF, NYT, JHU, and CTP [27][28][29]. Initial examples of differences in reporting included BroadStreet reporting 89 more counties when compared to other aggregators and thousands of anomalies (number is less than previous day) in county-level case and death counts. This result is highly biased against counties with rapidly growing cases and deaths and illustrates that reported numbers are not always immediately accurate. Table 1 provides a summary of various data collection methods by the different aggregators.

Data cleaning and preparation
Each aggregator's original dataset was cleaned and pre-processed in the R statistical computing language [35] to generate comparable datasets across aggregators. This included the removal of unknown case or death data, removal of unmatched Federal Information Processing Standards (FIPS), and removal of uncommonly reported US territories and/or smaller geographic delineations. In an effort to reduce the negative binomial skew resulting from a significant number of zero-count days (i.e., before COVID-19 was prevalent in a location), a date range of March 15, 2020 through June 30, 2020 was selected to reduce this concern while maximizing the inclusion of early count data. A daily case and death count was calculated using a date's cumulative count minus the preceding date's cumulative count. Negative counts resulting from daily count calculation for cases and deaths were dropped from the dataset. Negative counts accounted for 0.77% of county case data, 0.21% of county death data, 0.05% of state case data, and 0.15% of state death data. Before conducting the reliability analyses, two other notable changes were made to the data; daily counts were smoothed using a 3-day moving average to account for asynchronous reporting of daily cases by each aggregator and all daily case and death counts were modified by +1 to remove any remaining zero-count data to improve Cohen Kappa coefficient estimates and reduce the number of paradoxical results.
The issue of smoothing is significant for several reasons. First, data were extremely unreliable prior to smoothing due to the reporting process and significant outliers in the data, thus the smoothed results presented here are more positive (i.e., possess higher reliability estimates) from a data aggregator perspective. Second, this finding suggests that government officials and media agencies should utilize and stress moving averages (or some other form of smoothed data) rather than raw new counts given their lack of reliability, and therefore potentially varying fatality estimates, across data sources. This finding largely explains the significant increases and decreases in counts that are commonly seen in the data and reported by the media. These large changes in numbers are presumably not a function of large fluctuations in new cases or deaths, but instead an artifact of the data reporting process (see Data reporting process below and S1 Fig). This process may create unintentional panic or claim to communities.

Data reporting process
Generally, daily COVID-19 counts are reported from a given data source (e.g., county public health websites) and then extracted by aggregators [36]. The challenge with daily reporting counts is they depend on many varying factors; for example, the time and date these numbers are reported can be nearly immediate or significantly lag. This can be seen in Table 2, where the numbers of new cases often differ depending on the day and time these numbers were reported. Each aggregator reports or publishes the same day counts (e.g., cases and deaths recorded on September 12th 2020), at different times and on different days (e.g., one aggregator reports September 12, 2020 counts at the end of day on September 12, 2020, whereas another aggregator may report September 12, 2020 counts on at 8 a.m. on September 13, 2020). For example in Table 2, both Aggregator 1 and Aggregator 3 report 67 cases on April 30, 2020, whereas Aggregator 2 reported only 19 on April 30, 2020 and the additional 48 cases the following day (May 1, 2020). While the average number of cases are comparable across data aggregators, the number of cases are rarely reliable. Because of daily inconsistencies such as this, reliability estimates were computed and compared using the raw data and a three day moving average. Given the higher reliability of the moving average, this suggests that media and government reports, along with researchers, should consider using a moving average to better represent true trends in COVID-19 cases and deaths given the increase in reliability. This is especially important in early disease tracking when case count sample sizes are small.

Statistical analyses
To assess the inter-rater reliability (IRR) of COVID-19 aggregators based on new cases and death counts across the counties, states, and United States, a Kappa variant called linearly weighted Cohen's Kappa (LWCK) was used to examine agreement between paired aggregators due to the discrete nature of the data [37,38]. LWCK inherently takes into account the influence of chance agreement, thus improving the model's sensitivity towards disagreement among rater observation pairs.
After computing LWCK statistics at the county, state, and national levels, choropleth maps were generated for at each level to help visualize COVID-19 spatial event density and understand changes in reliability across aggregators and locations. While several standards have been proposed for IRR, this study employed the standard proposed by Cicchetti and Sparrow [39]: Excellent (0.75 to 1.00), Good (0.60 to 0.75), Fair (0.40 to 0.60), and Poor (0 to 0.40).

Bayesian
As discussed below, because it is important to examine how reliability results may manifest themselves in commonly used disease tracking statistics, we estimated case fatality rate over time (March 15, 2020 to June 15, 2020) using a novel empirical Bayes approach. Utilizing case fatality rates (number of deaths over a specified period of time) allowed us to track reporting method changes and how this affected the reliability when compared between aggregators. Thus, for each aggregator, we used an empirical Bayes procedure to compute a posterior Beta distribution for each state's case fatality rate (March 15, 2020 to June 15, 2020), based on the number of cases and deaths reported by the corresponding aggregator.
We start with the assumption that the death counts in each state are independently sampled from beta-binomial distributions with common shape parameters α and β, and with the statespecific number of reported cases. We then estimated the global (U.S. wide) α and β parameters via maximum likelihood estimation. These shape parameters form the empirical prior for the subsequent analysis, which from this point forward is a simple Bayesian estimation of binomial proportion. With this common prior, we separately computed the posterior distribution for case fatality rate for each state, based on that state's case and death counts. Table 3

State level
USAF, NYT, and JHU yielded the highest inter-rater LWCK reliability across all states for both the number of cases and deaths when examining each pair (see Fig 1) and for the average of these aggregator pairs (see Table 3). Intra-rater reliability between aggregators in the form of LWCK state daily counts can be seen in Fig 1, where higher agreement is represented by the darkest green (closest to 1.00) and less agreement is portrayed by the lightest green (0.00). Average reliability for cases and deaths across all aggregators was 0.86 and 0.66, respectively.
Taking a deeper look into the data, high (defined here as kappa � 0.90) mean inter-rater reliability averaged across aggregator pairings (see Table 4) was observed across all aggregator state case comparisons for LA, VA, SD, AZ, CT, MD, NJ, and FL when examining the number of cases. Further, the average reliability for deaths was high for only SD, ME, OK, and CT. Note, several states (i.e., RI, OK, MI, NV, and KS) had unacceptable average reliability statistics (defined here as kappa � 0.70) associated with the number of cases, and reliability was even worse for the death rates of 29 states (see Table 4). Comparing states that have high versus low mean inter-rater reliability averaged across aggregator pairings is important to note for better overall understanding and for the replicating of high mean inter-rater reliability reporting practices.
Based on these results and other results provided in Table 4, the number of cases data were consistently more reliable than the number of deaths. Further depending on the aggregator, the variance and range in reliability within a state can be large, regardless of whether it is the number of cases or deaths. Although several examples exist, for the average number of cases and death data, CTP had an average reliability of 0.76 within NY whereas BS only had a reliability of 0.23 within NY. This is important to note as it suggests that reliability is a function of both the aggregator and state.

County level
The maps in Figs 2 and 3 provide reliability statistics at the county level for each aggregator pair based on the number of cases and deaths, respectively. A higher rate of agreement in both figures is represented by dark purple (1.00) whereas a lower rate of agreement is represented by lime green (0.00) (see Figs 2 and 3). The presented results suggest that county-level reliability appears clustered within states, rather than scattered throughout the country. Perhaps more interesting is that certain pairs of aggregators are more reliable in some counties than others, thus pointing to the concern that the data collection methods used may result in significantly different conclusions (both for political and research purposes) at local levels. While it is clear from Table 4 that the level of reliability is often state dependent, Figs 2 and 3 demonstrate there is also significant variation within states. Bayesian. It is essential to examine case fatality rate (CFR) estimates and not just reliability statistics to explore how reliability estimates may translate into actual statistics used to track � Kappa can produce negative values 0 is random agreement among raters; 1 is complete agreement; less than 0 is generally interpreted as "no agreement." [40] https://doi.org/10.1371/journal.pone.0251470.t004 COVID-19. Since the empirical Bayes approach was repeated over time, we can see how the estimate for case fatality rate develops as more data becomes available (Fig 4).  Table 4), which results in an aggregate fatality range of 0.063 to 0.082 (26.20% difference). If we go back and look at CFR estimate differences across aggregators for NY we see: 0.008 to 0.016 (119.21% difference) on March 15, 2020; 0.050 to 0.076 (34.28% difference) on April 15, 2020; 0.065 to 0.083 (25.03% difference) on May 15, 2020; and 0.064 to 0.082 to (24.68% difference) on June 15, 2020. This also means our confidence in estimates will be low early on, especially when sample sizes are small. Looking at AK credible intervals (CI), its smallest lower CI is 0.007 and its largest upper CI is 0.024, which are fairly large differences for potential point estimates of fatality rates. What this implies is that low reliability, assessed by estimating the inter-rater reliability between aggregators, can lead to significant differences across aggregators in calculated disease tracking statistics, particularly early in a pandemic.

Discussion
This study compared the reliability of COVID-19 death and cases count data across national, state, and county-levels between data aggregators. As expected, given the larger sample sizes, reliability for both cases and deaths was higher at the national level across aggregators than at state and county levels. However, death count reliability was typically lower than reliability for reported cases. Variation in reliability remained across aggregators and suggests that   aggregator choice could have a significant impact on any data analysis or subsequent action based on the data.
These differences can partially be explained by the intended purposes and collections methods of each aggregator, making it ever more important that aggregators clearly define data collection methods and define terms used (i.e., cases and deaths). USAF [27], JHU [28], and NYT [29] have been reporting daily cases early in the pandemic and attempt to publish case and death totals in near "real-time" or as they are reported. They occasionally use county-level health departments as a source in instances where the state health department is lagging significantly behind them. Conversely, CTP [30] exclusively uses state health departments as a source to ensure their data are consistent. Broadstreet has emphasized updating data to ensure it is following a logical trend and, when possible, updating historic data to reflect more epidemiologically significant dates, such as date of death or date of symptom onset. These disparate approaches are important for different reasons (e.g., historical accuracy for retrospective analysis) and likely caused some differences in inter-rater reliability. This level of agreement was assessed by estimating the inter-rater reliability between aggregators.
State level reliability (Fig 1 and Table 4) demonstrated notable findings as well. First, reliability is not equal across states, suggesting that individual state practices and policies considerably influence the data's reliability. Second, some reliability estimates are extremely poor suggesting analysis of these data could produce inconsistent findings and biased results/inferences. These results imply the need for standardization of collection and reporting methods across states, which would increase both reliability and validity of the data. Ideally, this could also be done on a national level, enhancing the data's reliability and validity while also guaranteeing that the data, along with the cohesive reporting methods, is made publicly available.
USAF, JHU, NYT, CPT, and BS use various data collection and quality assurance methods, as well as sources [26][27][28][29][30]. Using state public health departments as a data source requires less maintenance and is more sustainable, but county health departments tend to be more up-todate than state health departments, which may be of paramount importance early in a pandemic. County health departments also report data based on Council of State and Territorial Epidemiologists (CSTE) guidelines and case definitions, which can help avoid including unreasonable cases in data [41]. However, this may potentially cause datasets to include duplicate cases and cases with an unreasonable definition, making the sum of all counties overestimate the national total.
While data scraping is a quick and accurate way to enter data, it is fraught with technical challenges (e.g., updating web sites breaks the scrapping code) and in our experience is not yet feasible, quicker, or any more accurate than manual crowdsourcing data entry. Moreover, data scraping requires human eyes monitoring it, which may cause data to be missed in instances where the health department alters their website. Likewise, a pipeline may fail to fetch data if the health department changes how their website is formatted; especially a potential issue if volunteers are not expecting this particular county to update daily, and may not notice a fetching issue. Updating historical data to include cases and deaths by date of symptom onset or death provides significant information when analyzing spread of virus and the effectiveness of preventative measures.
In regards to case and death data, the NYT attributes cases to the location they are being treated, which may provide a more accurate picture of how the virus is spreading in particular counties and states compared to using solely the date reported. Despite this, the information they use to assign cases to counties is inconsistently provided, and the data may include out of state visitors in state totals; if these cases are also reported back to the home state, these cases will be counted twice on the national level.

Limitations
Consequently, as a result of challenges posed in the many stages of data analyses, the reliability and validity of these statistics is critical when creating policies to protect the public and accurately modeling the disease. Disease data validity is imperative and should be the primary objective for any institute, as without validity there can be no reliability. Given that validity cannot be assessed without significant agency and/or government oversight, this study sought to evaluate COVID-19 data reliability, providing insight into the consistency of data across different sources. Due to this, the first limitation of this study is the lack of validity being addressed.
Fundamentally, the validity of any statistical analysis is based on the quality of data collected [39,[41][42][43][44][45][46]. Moreover, it is critical that aggregators are transparent in their data collection process so users can judge the validity of their process and can understand discrepancies in numbers across data collection sources. An important caveat is the validity of the final data source is largely dependent on the initial sources providing the data (e.g., state officials and hospitals). For this reason, it is critical that mechanisms are also put in place to evaluate the reliability and validity of data sources at this level. Unfortunately, to our knowledge there is currently no mechanism in place to evaluate this process or the accuracy of the data collected [2], resulting in uncertainty regarding the exact reasoning for data discrepancies across certain states and counties.
Due to validity concerns, several findings were clear when evaluating the reliability of reported daily cases and deaths across aggregators. First, a 3-day moving average is likely needed to ensure reliability across aggregators and eliminate large spikes or dips in the data associated with validity issues. To account for this, Cohen's Kappa was used, though with limitations, such as the potential for paradoxical coefficients such that high agreement yield zerovalue coefficients, negative coefficients, or abnormally low coefficients [47,48]. Modifying the moving average counts by +1 did improve overall kappa performance, however a handful of paradoxical results still occurred. Moreover, reliability was not examined for changes over time and the data in this study only extends to June 15, 2020. Reporting methods may have since updated or changed since this study occurred, which may result in some inaccuracies.
While it would be ideal if cases were reported using the date of infection or death rather than when the event was reported and all aggregators used these dates, this was not the case and frequently resulted in significant spikes on certain days (e.g., cases often dropped over the weekend and spiked on Monday or Tuesday, with the level of these spikes often being agency or county dependent). While the aforementioned example associated with data spikes certainly impacts the data's validity, one should be cognizant that it should not impact the reliability (i.e., aggregators should reliability report those spikes). With that said, it is clear from our evaluation of the aggregator's data that the practices applied across aggregators is not consistent (Table 3), thus practices should be put in place to increase reliability rather than relying on data smoothing methods to reduce the impact of inconsistent reporting.

Conclusions and relevance
The primary conclusion from this study is that the United States needs a national public data reporting system that is free from the inconsistencies and data discrepancies that result from decentralized data collection and aggregation. The technology to make this happen currently exists. More than 95% of U.S. hospitals use an Electronic Health Record system [5], which can be integrated into a near-real-time data reporting infrastructure to share data between local, state, and national public health agencies. Additionally, the CDC maintains a National Notifiable Diseases Surveillance System (NNDSS), which is used to aggregate data on nationally reportable and notifiable diseases. COVID-19 data are submitted electronically to the CDC by state or jurisdictional health departments via the COVID-19 Electronic Laboratory Reporting system. However, participation in reporting to the NNDSS varies widely between states because participation in the program is entirely voluntary. The problem, therefore, lies in the initial collection and eventual reporting of data from the states. Differences in the underlying Infection Fatality Rate (deaths per true number of infections, rather than deaths per detected/ reported cases) would cause discrepancies if certain states had a more vulnerable populace than others, due to demographics such as age or socioeconomic status.
Another likely contributor is differences in reporting. The lack of a nation-wide standard for reporting deaths means that different states may be more or less stringent in attributing deaths to the virus. A third possible source for disagreement across states is discrimination in testing. Due to limited availability of testing, some states became more restrictive in providing free tests to the public. Tests in such states were prioritized towards those exhibiting more severe symptoms, and consequently could have introduced case sampling bias towards a higher-risk subset of the greater population of infected individuals. Through future research, different databases and public sources will be incredibly valuable in the tracking and documentation of cases and deaths [49,50].
Standardizing infectious disease data collection and dissemination would empower practitioners to do more linking to other variables and analysis. For example, the Area Deprivation Index is a powerful indicator of many health outcomes [50][51][52]. The Centers for Disease Control and Prevention (CDC) reports social inequality and health systems issues as a cause for an increased risk of health and socioeconomic impacts as a result of COVID-19 for these groups [53,54]. Data reporting for race began in early April, with Louisiana being the first to report data [54][55][56]. Immediately, disparities in mortality deaths were noticed, and a June 2020 report by the CDC confirmed this disparity was widespread. Ultimately, the United States needs to nationally mandate explicit methods for reportable infectious diseases. This is a political problem, spanning policy, communication, and public health sectors. Public health funding should be directed toward the development of a national reporting database that clearly identifies COVID-19 cases and fatalities, as well as consistent reporting procedures, effectively modernizing disease reporting in the US.